Development of machine learning-based predictors for early diagnosis of hepatocellular carcinoma

Hepatocellular carcinoma (HCC) remains a formidable malignancy that significantly impacts human health, and the early diagnosis of HCC holds paramount importance. Therefore, it is imperative to develop an efficacious signature for the early diagnosis of HCC. In this study, we aimed to develop early HCC predictors (eHCC-pred) using machine learning-based methods and compare their performance with existing methods. The enhancements and advancements of eHCC-pred encompassed the following: (i) utilization of a substantial number of samples, including an increased representation of cirrhosis tissues without HCC (CwoHCC) samples for model training and augmented numbers of HCC and CwoHCC samples for model validation; (ii) incorporation of two feature selection methods, namely minimum redundancy maximum relevance and maximum relevance maximum distance, along with the inclusion of eight machine learning-based methods; (iii) improvement in the accuracy of early HCC identification, elevating it from 78.15 to 97% using identical independent datasets; and (iv) establishment of a user-friendly web server. The eHCC-pred is freely accessible at http://www.dulab.com.cn/eHCC-pred/. Our approach, eHCC-pred, is anticipated to be robustly employed at the individual level for facilitating early HCC diagnosis in clinical practice, surpassing currently available state-of-the-art techniques.

gastric carcinoma 15 , colorectal carcinoma 16 , pancreatic ductal adenocarcinoma 10,11,17 and so on.Thus, it is credible to identify a REOs-based transcriptional signature for early diagnosis of HCC.Nevertheless, it is not yet possible to implement these existing gene signatures in clinical practice even though they have a powerful diagnostic ability for early HCC.That's partially because these signatures were obtained from gene expression profiling data, which may not provide an accurate reflection of the changes in plasma proteins [18][19][20] .Since secreted genes can be translated into secreted proteins, which can be possibly used as tumor microenvironment or plasma signatures, we employed secreted genes for filtering feature.
Motivated by the establishment of various diagnostic signatures based on REOs to aid clinical HCC diagnosis decision, we designed robust and powerful predictors in this work.The developed predictors hybridized several algorithms, i.e., REOs, mRMR 21 , MRMD 22 , support vector machine (SVM) 23,24 , k-nearest neighbor (KNN) 24 , decision tree (DT) 25,26 , logistic regression (LR) 26 , extreme gradient boosting (XGBoost) 24 , logistic model trees (LMT) 27 , adaptive boosting M1 (AdaBoostM1) 28 and naïve bayes (NB) 29 .The REOs method was used for feature construction, mRMR and MRMD were used for feature ranking and selection, 2902 secreted genes (genes encoding secreted proteins) collected public database were used for feature filtering, and SVM, KNN, DT, LR, XGBoost, LMT, AdaBoostM1 and NB algorithms were used for classification purposes.Among the sixteen predictors, nine predictors (including mRMR + KNN, mRMR + SVM, mRMR + LR, mRMR + XGBoost, mRMR + LMT, MRMD + KNN, MRMD + SVM, MRMD + LR and MRMD + LMT) showed excellent results for all performance metrices in training set, and reached accuracy of 1, F1-score of 1 and AUC of 1, respectively.In validation datasets, the AUC value of mRMR + SVM predictor with the least number of 11 gene pairs (AUC = 0.9384) and MRMD + SVM predictor with 28 gene pairs (AUC = 0.9278) were higher among these nine predictors, and they were powerful predictors for HCC diagnosis even when the sampling location is not accurate.Simultaneously, mRMR + SVM predictor and MRMD + SVM predictor had a cross-platform effect and could be employed to diagnose early HCC at individual level.In addition, comparison results demonstrated that the performance of the established hybrid predictor mRMR + SVM and MRMD + SVM were much better when compared with Ao's method 14 and our previous work 13 .Importantly, a user-friendly web server was established, and it could be freely accessed at http:// www.dulab.com.cn/ eHCC-pred/ for aiding the early HCC diagnosis in clinical practice.

Derivation of HCC predictors
The whole procedure of analysis was designed as follows in Fig. 1.In present study, we used two feature selection methods and eight classification algorithms mentioned above to build sixteen predictors for HCC diagnosis by using gene expression profiles of 988 HCC and 332 CwoHCC accessed from the GEO database.First, on the basis of gene expression profiles of 988 HCC and 332 CwoHCC, 25,341,086 and 20,559,429 stable gene pairs were acquired, respectively.Among 25,341,086 and 20,559,429 gene pairs, there were 5765 stable reversal gene pairs between HCC tissues and CwoHCC tissues.Then, filtering gene pairs using 2902 secreted genes, we obtained 242 gene pairs, where gene i and gene j were secreted gene.Next, based on novel profiles with 242 features (gene pairs) (see "Methods" section), we captured the optimal feature (see Fig. 2).Table 1 showed the comparison of classification performance of various predictors obtained based on accuracy, F1-Score fitness function and AUC value.The results presented in Table 1 illustrated that nine predictors, including mRMR + KNN, mRMR + SVM, mRMR + LR, mRMR + XGBoost, mRMR + LMT, MRMD + KNN, MRMD + SVM, MRMD + LR and MRMD + LMT, showed excellent results for all performance metrices, and reached accuracy of 1, F1-score of 1 and AUC of 1, respectively.Among these nine predictors, the predictor of mRMR + KNN and mRMR + SVM had the least number of 11 gene pairs (see Table 2).

Validation of HCC predictors
Subsequently, we used independent datasets (including testing set, GEO sets, ICGC set and TCGA set) to validate the performance of various algorithms.In Table 3, for the 3057 HCC samples and 84 CwoHCC samples, MRMD + SVM predictor with 28 gene pairs (see Table S3) gained the highest accuracy and F1-score than other predictors in independent datasets, the accuracy, F1-score, and AUC were 0.9834, 0.9915, 0.9278 (95% CI is 0.8915-0.9642),respectively.However, the results also indicated that mRMR + SVM predictor with 11 gene pairs gained the highest AUC than other predictors in independent datasets, the AUC was 0.9384 (95% CI 0.9255-0.9514).
Since mRMR + SVM predictor and mRMR + KNN predictor with the least number of 11 gene pairs showed great results for all performance metrices in independent data, and MRMD + SVM predictor gained the highest accuracy and F1-score in independent datasets among 16 predictors, thus we focused on these three predictors in the next analysis.The detailed validation results of these three predictors in biopsy and surgery samples were shown in Table 4.For biopsy samples, both mRMR + SVM predictor and mRMR + KNN predictor yielded sensitivity of 1, specificity of 1 by using testing set (29 HCC samples and 48 CwoHCC samples), while MRMD + SVM predictor yielded sensitivity of 1, specificity of 0.8542.In GEO biopsy sets, mRMR + SVM predictor correctly classified 96.18% of the 131 HCC samples (GSE121248, GSE47197), mRMR + KNN predictor correctly classified 66.41% of the 131 HCC samples as well as all (100%) of the 131 HCC samples were correctly classified by MRMD + SVM predictor.For surgery samples, in the testing set (220 HCC samples and 36 CwoHCC samples), the sensitivity and specificity of two predictors (mRMR + SVM predictor and mRMR + KNN predictor) were 1. While, the sensitivity and specificity of MRMD + SVM predictor was 1 and 0.8889.This result demonstrated that mRMR + SVM predictor, mRMR + KNN predictor and MRMD + SVM predictor could discriminate HCC from CwoHCC correctly when using biopsy samples.
For surgery samples, in GEO surgery sets, 84.1% of the 2063 HCC samples were correctly classified by mRMR + SVM predictor, 70.04% of the 2063 HCC samples were correctly classified by mRMR + KNN predictor and 98.01% of the 2063 HCC samples were correctly classified by MRMD + SVM predictor.Moreover, among 2063 HCC samples, based on mRMR + SVM predictor, 79.76% of the 657 formalin-fixed paraffin-embedded (FFPE) HCC samples (GSE109211, GSE62743, GSE46444, GSE10141, GSE164760, GSE19977) were correctly recognized as HCC; while 58.14% of the 657 FFPE HCC samples was correctly classified by mRMR + KNN predictor and 99.85% of the 657 FFPEHCC samples was correctly classified by MRMD + SVM predictor.This result demonstrated that mRMR + SVM and mRMR + KNN predictor were available to the FFPE samples with RNA degradation.For the RNA-seq expression data obtained from TCGA and ICGC, the 11 gene pairs based on mRMR + SVM predictor could correctly identify 99.19% of the 371 HCC and the 98.77% of the 243 HCC samples, respectively.
While the 11 gene pairs based mRMR + KNN predictor could correctly identify 98.11% of the 371 HCC RNA-seq and the 97.94% of the 243 HCC RNA-seq samples.And MRMD + SVM predictor with 28 gene pairs could correctly identify all 371 HCC RNA-seq and all 243 HCC RNA-seq samples.This result demonstrated that mRMR + SVM predictor, mRMR + KNN predictor and MRMD + SVM predictor had a cross-platform ability.In summary, these three predictors had a cross-platform ability and could discriminate HCC from CwoHCC when using surgery samples, including FFPE samples with RNA degradation.
Furthermore, in Table S4, 82.86% of the 741 normal tissues in patients with HCC samples (NwHCC) samples and 82.04% of the 334 cirrhosis tissues in patients with HCC samples (CwHCC) samples were correctly classified by mRMR + SVM predictor, 67.48% of the 741 NwHCC samples and 57.49% of the 334 CwHCC samples were correctly classified by mRMR + KNN predictor, and 99.87% of the 741 NwHCC samples and 97.01% of the 334 CwHCC samples were correctly classified by MRMD + SVM predictor.This result showed that these three predictors could identify HCC adjacent tissues (CwHCC and NwHCC) from CwoHCC when using biopsy and surgery samples.www.nature.com/scientificreports/ In conclusion, for biopsy and surgery samples, these three predictors could identify HCC and its adjacent tissues (CwHCC and NwHCC) from CwoHCC even when sample location is not accurate and samples are FFPE samples with RNA degradation.Additionally, these three predictors had a cross-platform ability.Importantly, the performance of HCC diagnostic signature based on MRMD + SVM is superior to mRMR + KNN predictor and mRMR + SVM predictor in some independent datasets.

Comparison with previous predictors
To further verify the performance of mRMR + SVM, mRMR + KNN and MRMD + SVM predictor developed in current study, we compared with the existing predictors.Two published studies about finding REOs-based signature for early HCC diagnosis have been completed by Ao et al. and our previous work.In 2018, combining rank difference with majority voting rule, Ao et al. presented a signature by applying 491 HCC samples and 149 CwoHCC samples.This signature, including 19 gene pairs, was chosen from 72 reversal gene pairs.And it yiled the accuracy of 0.9969.In 2020, we identified an early diagnostic signature of HCC from 857 reversal gene pairs on the basis of mRMR and SVM.Using 1091 HCC samples and 242 CwoHCC samples, 11 gene pairs were derived and denoted as the signature, which achieved 1 of accuracy.Due to the difference of training data, a comparison of current results in this paper with existing results in previous studies is an unfair comparison.Therefore, we utilized the same evaluation criteria.To further assessed effectiveness of presented predictors, experimental results in independent datasets were used to perform comparison objectively.
In Table 2, for training set, both mRMR + SVM predictor with 11 gene pairs and mRMR + KNN predictor with 11 gene pairs achieved accuracy of 1, F1-score of 1, as well as the number of gene pairs is the least.Also,  HCC samples and all 50 NwHCC tissues were correctly identified as HCC.In addition, 240 out of 243 HCC samples from TCGA were also correctly identified as HCC.While based on MRMD + SVM predictor, all 243 HCC samples were also correctly identified as HCC.
Results in Table S4 displayed the identification of both HCC and its adjacent non-cancer (NwHCC and CwHCC) from CwoHCC by biopsy and surgery samples.For 131 HCC biopsy samples, the sensitivity of proposed mRMR + SVM predictor with 11 gene pairs (18 secreted genes) and MRMD + SVM predictor with 28 gene pairs was 0.7526 and 1, which were higher than Ao's method (0.6031).The identification ability of proposed mRMR + SVM predictor was also better than Ao's method in 80 CwHCC samples.Additionally, among these methods, mRMR + SVM predictor and MRMD + SVM predictor displayed the better classification in 657 HCC FFPE samples, 1800 HCC surgery samples (657 HCC FFPE samples were included) and all 1931 HCC samples (1800 HCC surgery samples and 131 HCC biopsy samples were contained).For 657 HCC FFPE samples, the accuracy of Ao's method, our previous method (11 gene pairs, 2020), proposed mRMR + SVM predictor and MRMD + SVM predictor in this study was 0.172, 0.3973, 0.7976, 0.9985, respectively.For 1800 HCC samples, the accuracy of Ao's method, our previous method, proposed mRMR + SVM predictor and MRMD + SVM predictor was 0.6639, 0.7656, 0.8428, 0.9872, respectively.For 1931 HCC samples, the accuracy of Ao's method was 0.6572, the accuracy of our previous method was 0.7815, while the accuracy of the proposed mRMR + SVM predictor and MRMD + SVM predictor could increase to 0.8503 and 0.97, respectively.Above result suggested that mRMR + SVM predictor and MRMD + SVM predictor displayed the better performance when comparing with Ao's method and our previous method.
In conclusion, methods developed in this paper produced higher accuracy and had superior prediction and diagnosis abilities compared to other published methods, especially for FFPE samples.Therefore, the mRMR + SVM predictor and MRMD + SVM predictor were deemed superior and more suitable predictors for facilitating early HCC diagnosis in clinical practice.

Conclusions
In this study, we developed eHCC-pred, a machine learning-based predictor for early diagnosis of HCC, using REOs and two feature selection methods (mRMR and MRMD).The eHCC-pred comprised of two machine learning predictors: MRMD + SVM predictor and mRMR + SVM predictor.In the training set consisting of 988 HCC samples and 332 CwoHCC samples, both MRMD + SVM predictor and mRMR + SVM predictor achieved perfect accuracy, F1-score, and AUC values of 1. Subsequently, the performance of these predictors was evaluated on independent datasets comprising 3057 HCC samples and 84 CwoHCC samples.The mRMR + SVM predictor exhibited a higher AUC value (0.9384) compared to the MRMD + SVM predictor (AUC = 0.9278), while the latter attained the highest accuracy of 0.9834 and F1-score of 0.9915.Finally, we compared our results with previous methods in this field.It is important to note that the data preprocessing level of our previous method 2020 (involving 11 gene pairs) is equivalent to the current work.The accuracy of early HCC identification has significantly improved, with a remarkable increase from 78.15 to 97%, based on identical independent datasets.Our approach, eHCC-pred (http:// www.dulab.com.cn/ eHCC-pred/), is expected to be robustly utilized at an individual level to facilitate early diagnosis of HCC in clinical practice surpassing currently available state-ofthe-art techniques.

Discussion
High accurate and early diagnosis is the key point to hepatocellular carcinoma patients.Current work developed and validated machine learning-based predictors to aid early HCC diagnosis in clinical practice.Among the sixteen predictors, the mRMR + SVM predictor comprising of 11 gene pairs (18 secreted genes) and the MRMD + SVM predictor consisting of 28 gene pairs (34 secreted genes) exhibited superior predictive capability in validation datasets, thereby potentially enhancing the precision of decision-making during HCC diagnosis.

Feature construction method
REOs was a feature construction method which has been applied to acquire a dependable and robust signature from gene expression profiling.In case of a gene pair (gene i and gene j), Gi > Gj represented that the expression of gene i was higher than the expression of gene j, Gi < Gj represented that the expression of gene i was lower than the expression of gene j.Stable gene pairs meant that the pattern of Gi > Gj or Gi < Gj was kept in at least 85% samples.One stable gene pairs which kept Gi > Gj in HCC tissues and Gi < Gj in CwoHCC tissues was denoted as a reversal stable gene pair, and then this gene pair would be selected as the candidate REO-based qualitative diagnostic signature.After obtaining reversal gene pairs between HCC and CwoHCC tissues, 2902 secreted genes were used for filtering gene pairs.Next, based on the reversal gene pairs and gene expression profiling, new profiles encoded by 0, 1, and − 1 were generated, where 1 represented Gi < Gj, 0 represented Gi > Gj, − 1 represented other cases (Gi or Gj does not exist), respectively.

Feature selection method and incremental feature selection
To pick out valid gene pairs for HCC diagnosis, mRMR 21 and MRMD 22 algorithms were applied for feature selection.Here, a gene pair was considered as a feature.The principle of mRMR algorithm is simple: to find maximum correlation while removing redundant features, which is equivalent to obtaining a set of "purest" feature subset (features differ greatly from each other and are also highly correlated with the target variable).It is based on information theory and can be computed by mutual information (MI), MI and mRMR were formulated as follows: where f represents the vector of feature, T represents disease type, represents the set of ranked features, MI(f i , T) represents MI between feature f i and class T , and MI(f i , f j ) represents MI between f i and f j .MRMD is to select feature subsets that are strongly correlated with class label and have low redundancy among features.MRMD feature selection method is mainly determined by the following two parts.The first is the correlation between feature and class label.MRMD calculates the correlation between feature and class label by Pearson correlation coefficient.The larger the Pearson correlation coefficient is, the closer the relationship between features and class label is.The second is the redundancy between features.Three distance functions (Euclidean distance, Cosine distance and Tanimoto coefficient) are used to calculate the redundancy between features.And the larger the distance is, the lower the redundancy between features is.More details about MRMD can be found in Zou's paper 22 .In this study, Cosine distance was used.
Based on the new encoding profiles and two feature selection methods, we obtained a list of ranked gene pairs.Subsequently, using incremental feature selection (IFS) strategy 33 , the optimal gene pairs which could produce the best diagnosis for HCC was chosen from 242 mRMR and MRMD gene pairs.

Classification through machine learning methods
Machine learning techniques included SVM, KNN, DT, LR, XGBoost, LMT, AdaboostM1 and NB were adopted to establish predictive diagnostic predictors of early HCC.Notably, XGBoost and and NB were performed by using R package "xgboost" and "naivebayes", respectively.For XGBoost model, The parameters of XGBoost model are nrounds = 25 and objective = "binary:logistic".Another six classification methods were performed by using R package "RWeka", the function of SMO, IBk, J48, LR, LMT and AdaBoostM1 was used.And SMO provides a support vector classifier using RBF kernels with a non-default gamma parameter (argument '-G'), G = 2. IBk generates a k-nearest neighbors classifier, J48 provides unpruned or pruned C4.5 decision trees, LR produces logistic regression model and LMT carries out "Logistic Model Trees".The AdaBoost M1 method of Freund and Schapire is implemented by AdaBoostM1 function and decision stumps (trees with a single split only) are used as base learners for AdaBoostM1.

Performance evaluation of predictors
In the current study, we assessed the performance of our prediction predictors on independent cohorts that include testing set and other independent datasets (array and RNA-seq gene-expression data) obtained from GEO, ICGC and TCGA (see Table S1), which were not used for training.Five popular indexes were calculated to evaluate the diagnostic ability of the gene pair signature for early HCC.They are sensitivity, specificity, accuracy, F1-score and area under receiver operating characteristic curve (AUC).
(1) MI(f i , T) = P(f i , T) ln p(f i , T) p(f i )P(T) df i dT,

Figure 1 .
Figure 1.The workflow of analyses.

Figure 2 .
Figure 2. A plot to show the IFS curve.Through adding features (gene pairs) ranked by mRMR and MRMD feature selection method one by one, the optimal feature was obtained when the highest accuracy was achieved.

Table 1 .
Comparison of various predictors based on accuracy and F1-score fitness function with feature selection in training set.NO.Opt number of optimal signature, NO.HCC number of HCC samples, NO.CwoHCC number of CwoHCC samples, ACC accuracy.

Table 2 .
The 11 gene pairs' signature ranked by mRMR.Gene i has a higher expression level than Gene j in HCC patients compared with CwoHCC patients.

Table 3 .
The performance of various predictors in independent datasets.NO.Opt number of optimal signature, NO.HCC samples, number of HCC samples, NO.CwoHCC samples, number of CwoHCC samples, ACC accuracy.MRMD + SVM predictor with 28 gene pairs achieved accuracy of 1, F1-score of 1.As shown in Table3, for a total of 3057 HCC samples and 84 CwoHCC samples, mRMR + SVM predictor was the best predictor, which yielded AUC of 0.9384, and its accuracy and F1-score were 0.8914 and 0.9351, respectively.In Table4and TableS4, for biopsy samples, based on the mRMR + SVM predictor, 96.18% of the 131 HCC samples from 2 datasets (GSE121248, GSE47197) could be correctly identified as HCC.Moreover, 75.26% of the 97 NwHCC samples from 2 datasets (GSE121248 and GSE64041) and all 80 CwHCC samples in GSE54236 were classified as HCC.While, based on MRMD + SVM predictor, all of 131 HCC samples could be correctly identified as HCC, all 97 NwHCC samples and all 80 CwHCC samples were classified as HCC.For surgery samples, 1800 HCC samples from 24 datasets were used to perform evaluation and 657 of them were FFPE HCC samples from 6 datasets.